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Rev. Lett. 102, 178101 (2009)], we present a complete picture of a Monte Carlo method based on 
Wang-Landau sampling in combination with efficient trial moves (pull, bond-rebridging and pivot 
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most sequences we could also determine the entire energy density of states and, together with 
suitably designed structural observables, explore the thermodynamics and intricate folding behavior 
in the virtually inaccessible low-temperature regime. We analyze the differences between random 
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problems in protein folding. 
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I. INTRODUCTION 

Coarse-grained (lattice-) models play an important 
role in clarifying questions of generic understanding in 
protein folding and protein structure prediction. With 
the aim of separating relevant features from unimportant 
details, such "minimalist" models allow us to capture 
only those forces that effectively drive a system and thus 
to eventually find the "basic laws" behind a particular 
phenomenon. Arguably one of the simplest protein mod- 
els is the hydrophobic-polar (HP) lattice model intro- 
duced by Dill et alj^ Classifying the 20 amino acids in just 
two types (hydrophobic and polar), the HP model con- 
tains only two physical ingredients: an excluded volume 
effect (self-avoiding walk on a lattice) and an effective 
monomer-monomer interaction among non-bonded near- 
est neighbor H monomers, mimicking the hydrophobic in- 
teraction which is considered a main driving force of pro- 
tein folding. Owing to the HP and similar models, many 
fundamental concepts and questions, e. g. the relation- 
ship between protein sequence and structure, the notions 
of energy landscapes and folding funnels, thermodynamic 
transitions towards and stability of the native state, or 
the kinetic mechanisms of folding, could be systemati- 
cally investigated by means of computer simulations ii^i^ 

Minimalist protein models also laid a basis for the 
study of many related problems of biological inter- 
est. Examples include protein aggregation (multi-chain 
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systems),^ surface adsorption^i^ protein folding in het- 
erogeneous (e.g. membranes}^ and crowded or confined 
environments^ or the formation of knots in proteins 

A common requirement for such studies is a generic 
Monte Carlo method capable of efficiently sampling the 
conformational spaces of the system. Despite the formal 
simplicity and minimalistic framework of the HP and re- 
lated models, however, numerical simulations are often 
computationally very demanding. The reliable numerical 
estimation of the low temperature (T) thermodynamic 
behavior of these models is particularly challenging. The 
origin of this difficulty is twofold: (i) Steric constraints 
imposed by the underlying lattice (attrition problem) and 
other conformational or energy barriers lead to rough en- 
ergy landscapes and, thus, hamper the proper sampling 
of conformational space. These issues become particu- 
larly noticeable at low T, where the polymers exhibit 
very compact conformations, (ii) The low-energy confor- 
mations often have a very low degeneracy and the energy 
density of states (DOS), g{E)^ (which measures energy 
degeneracy) increases rapidly with chain length or num- 
ber of energy levels. 

Whereas problem (i) is not unique to protein-like 
models and also appears in simulations of (off-) lattice 
homopolymers,^^ problem (ii) is a direct consequence of 
protein sequence specificity. For example, the HP se- 
quence with 103 monomers (investigated below) has 59 
energy levels and g{E) spans more than 50 orders of 
magnitude, whereas the corresponding interacting self- 
avoiding walk (i. e. homopolymer of the same length with 
H monomers only) contains 139 energies but with a g{E) 
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range of 38 orders of magnitude only. 

Together, these obstacles pose a significant challenge 
to numerical studies and have led to the invention of var- 
ious sophisticated, but often also specialized algorithms, 
such as sequential importance sampling^^ and chain- 
growth methods the most prominent example be- 
ing the pruned-enriched Rosenbluth method (PERM)^ 
and its variants^^^^^ and flat-histogram/multicanonical 
versions, ^^'^^ equi-energy sampling^ multi-self over- 
lap ensemble Monte Carlo, fragment regrowth Monte 
Carlo^ etc. (see also references therein). 

Since finding the lowest energy conformation (s) for a 
given HP sequence is an NP-complete problem^ the 
model has also raised interest in the computer science 
community as a challenge in combinatorial optimiza- 
tion. Targeted specifically to the search of minimal en- 
ergy conformations, several (heuristic) algorithms have 
been proposed, ranging from genetic algorithm o^^i^^ and 
evolutionary Monte Carlo^ or ant colony models^^ to 
constraint-based algorithms However, we stress that 
such approaches, while potentially advancing research in 
protein structure prediction^ are only of limited use to the 
understanding of the thermodynamics/kinetics of protein 
folding. 

One motivation behind this work was the desire to 
develop an algorithm with the power to attack the HP 
model and the fiexibility to be applicable also to more 
general setups as described above. Expanding on pre- 
vious work^ we present here a generic, fully blind and 
fast Monte Carlo procedure, based on the combination of 
Wang-Landau sampling^^ with efficient Monte Carlo trial 
moves, which is very successful both in finding low energy 
conformations and obtaining thermodynamic properties 
for HP-like models over the entire temperature range, 
including the difficult to access low- temperature regime. 
Wang-Landau sampling has been shown to be very ef- 
ficient and robust in various fields of statistical physics 
including simulations of spin systems polymers and 
proteins, ^^'^^ or even numerical integration,—*^ (see also 
references therein). Instead of sampling a system at a 
single temperature, one estimates g{E) from which ther- 
modynamic quantities at any temperature can be de- 
rived. By performing a random walk in energy space, 
the algorithm is well suited for overcoming energy bar- 
riers typically encountered in complex free energy land- 
scapes. However, in order to fully exploit the capabilities 
of the algorithm for systems which also have complex 
conformational spaces, suitably chosen Monte Carlo trial 
moves must be introduced. Finding an optimal inter- 
play between Monte Carlo driver, trial move set and effi- 
cient implementation is a significant achievement of the 
present study. 

The paper is organized as follows: We first provide a 
detailed account of our methodology (Sec. |TT1)- Compar- 
ing with two other successful methods, we then demon- 
strate its efficiency in finding low energy states for com- 
mon benchmark HP sequences (Sec. IHI A|) . In Sec. IIII Bl 
we study the thermodynamics of these model proteins 



and show how to elucidate the subtle conformational 
changes occurring at low temperature by means of suit- 
able structural observables. In Sec. IIII C\ we investigate 
the generic differences in the thermodynamic behavior 
between random and protein-like heteropolymers for long 
chain lengths. Finally, in Sec. II VI gives our conclusion. 



II. MODEL AND SIMULATION METHOD 

HP model proteins consist of isolated self-avoiding 
walks (SAWs) on a regular lattice with each site of the 
walk being occupied by a monomer (either polar or hy- 
drophobic). Here, we only consider the commonly stud- 
ied square (2D) and simple cubic (3D) lattices. Self- 
avoidance means that no lattice site can be occupied by 
more than one monomer at any time. N denotes the 
number of monomers (i.e. the SAW has length N — 1). 
The energy ^ of a protein conformation is defined by the 
number of non-bonded nearest neighbor contacts among 
hydrophobic monomers, each of which being associated 
with an attractive energy — e. 

In the canonical ensemble the partition function is 

Z^(T) = =Y.9{E)e-^'^^, (1) 

where k is Boltzmann's constant and T is the temper- 
ature. The first sum runs over the set of all conforma- 
tions while the second sum over all energies E intro- 
duces the energy density of states, g{E). Since g{E) does 
not depend on T, the second form allows us to calculate 
Zn{T) at any T and is thus the target quantity of our 
interest. 

Despite the conceptual simplicity of the model, esti- 
mating g{E) over the entire - and in particular low - en- 
ergy range of long HP protein sequences is a non-trivial 
task (see the Introduction). Principally, it can be sub- 
divided into three aspects: (A) choice of Monte Carlo 
sampling scheme (Monte Carlo driver); (B) choice of ap- 
propriate and efficient Monte Carlo trial moves; (C) effi- 
cient and fast implementation. 



A. Wang-Landau sampling 

In Wang-Landau sampling, the a priori unknown den- 
sity of states g{E) of energy E is iteratively determined 
by performing a random walk in energy space seeking to 
sample a ffat energy distribution. The expression "ran- 
dom walk in energy space" emanates from the fact that 
conformations are sampled with a probability approach- 
ing ~ l/g{E) resulting in a uniform distribution in E. 
Note that the method can, in principle, be applied to 
any type of observable (not only energy) or to several 
observables simultaneously ] 

Initially, g{E) = 1, V£^. A Monte Carlo trial move from 
a state (or conformation) A with energy Ea to a state B 
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with energy Eb is accepted according to the transition 
probabihty 

''<"-^> ="■'"('• sir)- 

After each trial move, g is modified as g{Et) = / x g{Et) 
(/ > 1 is called the modification factor, initially / = e^) 
and, simultaneously, a histogram H is updated, H{Et) = 
H{Et)-\-l, where t stands for state B if the trial move has 
been accepted, otherwise t = A. Once the energy distri- 
bution in H is sufficiently fiat (i.e. H{E) > p{H{E)) for 
all energies E^ where {H{E)) is the average histogram 
and < p < 1 is called the "fiatness criterion"), the 
modification factor / is reduced as / = ^/f^ H is reset to 
zero and a new iteration begins. This process is repeated 
until / falls below a threshold (/final ^ 1) at which point 
g{E) has converged towards the correct density of states. 

The accuracy of the calculated g{E) is controlled by 
two simulation parameters: the final modification factor 
/final and the fiatness criterion p. For a discussion on the 
choice of these parameters, see below. 

Knowledge of the upper and lower energy boundaries 
as well as any nonexistent energy states (energy gaps) 
within this range, is essential in the WL algorithm in 
order to examine the fiatness of the histogram and, ul- 
timately, to control the course of the simulation. Often, 
however, the exact energy range is a priori unknown, 
(hence the use of ground state search algorithms, e. g. 
for the HP model). To solve this dilemma, performing a 
pre-WL run to determine the energy range and thereafter 
fix the sampling to this range has been proposed j^^i^^ 
The problem with such an approach is two-fold: (i) The 
pre-WL run may take a long time to accurately explore 
energy space and it is thus unclear when to stop, (ii) Of- 
ten, new energy states are found rather late in the course 
of the simulation when the running DOS estimate has 
resulted in a sufficiently "fiat sampling" . 

To overcome these difficulties, the following procedure 
proved to be most efficient :^^ Every time a new energy 
state ^new is found, it is marked as "visited", g{Enew) 
is set to ^min (i-e. the minimum of g among all previ- 
ously visited energy states) and the histogram is reset 
to zero. The ffatness of the histogram is always checked 
among visited energy states only, with a sufficiently long 
interval between subsequent checks (e. g. every 10^ MC 
moves). With this self-adaptive procedure, new regions 
of conformation space can be explored while the current 
estimate of the DOS is further refined which, in turn, 
tends to improved samphng (with respect to a fiat his- 
togram in energy space). 



B. Monte Carlo trial moves 

Compared to traditional Monte Carlo schemes, the 
Wang-Landau algorithm has been an enormous improve- 
ment in overcoming the obstacles typically encountered 



near phase transitions. However, the advantageous dy- 
namics of this Monte Carlo driver is most effective when 
used in concert with suitable Monte Carlo trial moves. 

Usually, studies of lattice polymers introduce local 
moves that change a single bond (end ffips), two bonds 
(corner ffips) or three bonds (crankshafts) However, 
such moves suffer from slow dynamics and are non- 
ergodic.^^ Global moves, like pivot operations, sample 
conformational space of (athermal) self-avoiding walks 
very efficiently^^ but in the low temperature, compact, 
polymer phase their acceptance ratio becomes negligible. 

The key to our approach is the combination of two 
"non-traditional" Monte Carlo trial moves which com- 
plement one another extremely well, namely pull moves^^- 
and bond-rebridging moves, see Fig. [H 

Pull moves: This recently introduced trial move allows 
for the close-fitting motion of a polymer chain within 
a confining environment by continuously "pulling" por- 
tions of the polymer to unoccupied neighboring sites of 
the chain. For a detailed description, see Ref. |40|. Pull 
moves are reversible and ergodic. They feature a good 
balance between local and global conformational changes; 
on average, less than 10 monomers are displaced during 
a pull move, keeping the change in energy per move mod- 
est and thus always guaranteeing a reasonable acceptance 
ratio. Moreover, pull moves provide an efficient means of 
folding and unfolding because of their reptation-like dy- 
namics and the fact that monomers move largely along 
paths that were already occupied; hence, it is less likely 
that a trial move violates the "excluded volume" condi- 
tion. Such features are vital in WL sampling to explore 
both conformation and energy space efficiently. Fig. [1] 
shows some examples of valid pull moves and one invalid 
(non-reversible) end pull move (for completeness, we note 
this case explicitly here since it was not clearly mentioned 
in the original descriptions). 

Bond-rebridging moves: With decreasing temperature, 
polymers form dense and compact conformations leav- 
ing only very few unoccupied lattice sites in the bulk. 
Therefore, monomer displacing trial moves are restricted 
to act on surface beads, and thus become more and more 
ineffective with lower T and larger N. In contrast, bond- 
rebridging (or connectivity altering) moves^ — allow the 
polymer to change its conformation even at highest densi- 
ties by reordering bonds while leaving monomer positions 
unchanged. Moreover, they facilitate long range topolog- 
ical changes, e. g. entanglement, which would otherwise 
require very costly unfolding/folding processes. This lat- 
ter feature also becomes important when the WL sam- 
pling of the DOS is split up into energy subintervals as 
it substantially reduces the risk of "locking-out" confor- 
mational space. Fig. [1] shows the three types of bond- 
rebridging moves applied here; for a detailed discussion, 
see Ref. 41. 

The combination of bond-rebridging and pull moves 
yielded an overall speed-up in WL convergence of up to 
a factor three as compared to pull moves only. More im- 
portantly, ground states of some HP sequences would not 
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FIG. 1. Illustration of MC trial moves. Left panel: Pull moves: (a) single-bead move (kink flip); (b) two-beads move; (c) 
internal multi-beads move; (d) chain-terminal move; (e) chain-terminal move forming a "hook" ; this move is not allowed (non- 
reversible). The dotted circles denote the primary (A) and secondary (B) displacement sites of monomers A and B, respectively. 
Subsequent monomers are then pulled sequentially to previously occupied sites until the chain reaches a new valid configuration. 
Right panel: Bond-rebridging moves: (f) chain internal move ("type 1" in Ref. i41i) with two consecutive cut-and-join steps (in 
the intermediate stage, the chain is divided into a circular and a linear piece); (g) chain internal move ("type 2" in Ref. l4lh 
with a single cut-and-join step; (h) chain-terminal move. Bond cutting and rejoining are marked with a cross and a dashed 
line, respectively. Note that cutting/rejoining steps may alter the HP sequence along the chain (indicated by beads colored 
in gray in the intermediate stages), so once the chain has moved to a new conformation, monomer types must be relabeled to 
ensure the same HP sequence. Double arrows indicate reversibility of moves. All examples have been adopted from Refs. l40l , 
lUandliil. 



have been found at all without the concerted interplay of 
these two types of trial moves. 

To further accelerate global conformational changes, 
we supplemented our trial move set with pivot moves^ 
Although their impact is quite sequence dependent, pivot 
moves further improved the sampling (i. e. WL conver- 
gence time) for most of the HP sequences considered be- 
low. Pivot moves become a prerequisite for chain lengths 
N ^ since the dynamics of pull moves alone would 
be too slow to sample the extended coil conformations of 
long polymers. 

Detailed balance: Wang-Landau sampling is a non- 
Markovian process and its convergence has been shown 
without relying on the condition of detailed balance4^i^ 
Nonetheless, it is important that trial moves fulfill de- 
tailed balance in order to avoid systematic errors. Un- 
like, for instance, single-spin- flip dynamics in the Ising 
model, where the number of trial moves is always con- 
stant, here the number of valid trial moves may vary from 
one conformation to the next. It is therefore necessary 
to account for a possible imbalance when performing a 
Monte Carlo step from a starting state A to a trial state 
B. One possibility to do so, is to enumerate all valid 
trial moves in A and 5, and augment the usual Wang- 
Landau (or Metropolis-Hastings) acceptance rule with 
a term compensating for unequal trial move ratios ^^ii^ 
This enumeration process is, however, computationally 
very expensive. A more efficient Monte Carlo scheme 
that avoids this counting procedure in the case of pull 



moves has been proposed in Ref. |46| Detailed balance is 
guaranteed if a trial move is reversible and the reverse 
move has the same probability of being selected as the 
original move. Therefore, it is possible to carry out a 
"trial-and-error" procedure where trial moves are chosen 
randomly, but with constant probability, i. e. indepen- 
dent of the current conformation. Whenever such a trial 
move is invalid (e. g. resulting in overlapping monomers) 
the move is rejected and g{E)^ H{E) of the old confor- 
mation are updated. Otherwise, the move is accepted 
according to Eq. ([2]). A careful analysis shows that this 
procedure can be applied for all three move types used 
here. So do all of them fulfill reversibility; sometimes 
multiple possibilities exist to go from A to but this 
number is always the same in both directions for a given 
type of move. Thus, it suffices to employ selection rou- 
tines for each move type which ensure that trial moves 
are chosen with constant probability: 

(i) Pull moves: a list is generated prior to simulation, 
specifying the relative displacements of the primary and 
secondary pull sites {A and B monomers in Fig. [1]) for 
the entirely elongated polymer chain. This particular 
conformation features the maximally possible pull moves, 

maxnpuii = {N - 2) x 4.{d - 1) 

+ 2 X (2d - 1 + (2d - 2)^), 

where the first term corresponds to internal pull moves 
and the second term to end pull moves, respectively. No 
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other conformation can have more vahd puh moves than 
this upper bound and ah vahd pull moves of any other 
conformation are contained in this list as a subset. Then, 
a Monte Carlo trial step simply consists in choosing a 
(valid or invalid) pull move at random from this list. 

(ii) Bond-rebridging moves: A random integer i is 
drawn in the interval [0, A/"]. If i = or 1, a chain termi- 
nal rebridging move is selected (on the one or the other 
end of the polymer, respectively), otherwise, a chain in- 
ternal bond rebridging move (type 1 or 2) is attempted 
starting from the bond between monomers i — 1 and i, see 
Fig. [H These trial moves are then carried out as detailed 
in Ref. iH. 

(iii) Pivot moves: The elements (rotations, reflections 
and combinations thereof) of the symmetry group of the 
6Z-dimensional lattice are orthogonal matrices with in- 
teger entries. There are d\ x 2^ such matrices and they 
can be assigned prior to the simulation.^ Thus, to per- 
form a pivot move trial, a symmetry operation is selected 
at random (excluding the identity) and then applied to 
the shorter part of the polymer chain subsequent to a 
randomly set pivot point. 

Finally, the ratios among the three types of moves are 
kept constant. In all our simulations in this work, we 
used fractions of 75%, 23%, 2% for pull, bond-rebridging 
and pivot moves, respectively. These ratios turned out to 
provide a good balance to sample conformational space 
efficiently over the entire energy range. 

By adequately adapting these move fractions, it is pos- 
sible to achieve optimal acceptance ratios in any energy 
range. This can even be automated by introducing en- 
ergy dependent (i.e. variable) move fractions. In this 
case the Wang-Landau acceptance criterion Eq. (j2j) must 
simply be modified to account for unequal fractions be- 
tween the forward and backward (reversible) move in or- 
der to guarantee detailed balance. Although we did not 
observe a significant increase in overall efficiency in our 
WL simulations when using this technique for the HP 
sequences considered here, it might help for very long 
polymer chains. 



C. Efficiency considerations and the choice of / 
and p 

Despite generally higher rejection ratios, the "trial- 
and-error" approach described above outperforms the 
enumeration alternative^ in speed by one to two orders 
of magnitude (depending on chain length and sampled 
energy range) without loss in accuracy. To further accel- 
erate our simulations, additional efficiency mechanisms 
have been implemented: 

Data structures: Two different (redundant) data struc- 
tures have been used to store a protein conformation. 
Monomer positions are (i) stored as d-dimensional coor- 
dinate vectors and (ii) mapped to the sites of a lattice (or 
"bit table") of size L^, L = A'+2, with periodic boundary 
conditions. (L = N-\-2 always ensures unhindered pulling 



by two lattice units for end pull moves.) Whereas the 
former representation facilitates relative monomer dis- 
placements or pivot operations, the latter allows nearest 
neighbor occupancy queries or self- avoidance checks in 
time 0{d). Indeed, these very frequent operations would 
otherwise scale prohibitively as 0((i x N). 

Energy calculation: Pull moves usually displace only 
a small fraction AA" of all monomers; often, this applies 
to pivot and bond-rebridging moves too. It is thus more 
efficient to calculate only the change in energy between 
subsequent Monte Carlo trial moves rather than looping 
over the entire chain to calculate the energy as long as 
A A" < N/A. Together with the usage of a bit table, the 
time for an energy calculation scales then as 0{dx AN). 

Random number generator: For all simulations we used 
the Mersenne Twister (MT19937)^^ generator because 
of its speed. For cross-validation, we used RANLUX^ 
a random number generator of very high statistical 
strength. Both generators are provided in the GNU Sci- 
entific Library (GSL)^^ 

Simulation parameters: To verify our algorithm and 
to understand the influence of the two WL simulation 
parameters (final modification factor /final and flatness 
criterion p) on the accuracy of the results, we performed 
extensive tests for two short HP sequences for which 
the DOS are known exactly by means of enumeration, 
namely a 20mer^^ in 2D and a 14mer^^ in 3D. 

Fig. [2] shows the normalized mean square error 
(NMSE) of our estimates of g{E) for the 20mer and the 
dependence on / and p (results for the 14mer are equiv- 
alent). The NMSE is defined as 

measuring both the systematic (bias term) and statistical 
(variance term) errors, g, g^ and gi stand for the exact, 
averaged (A" = 200 independent runs have been carried 
out for statistical reliability) and individual DOS, respec- 
tively, at a certain energy E. 

Both the systematic reduction of the NMSE by de- 
creasing / and increasing p (Fig. [21 top) as well as the 
very high accuracy achieved over the entire energy range 
for the most stringent parameter choice (In /final = 10""*^^ 
and p = 0.999, Fig. [21 bottom) show clear evidence 
of the correctness of our algorithm and in particular 
of our "trial-and-error" approach of combined usage of 
pull, bond-rebridging and pivot Monte Carlo trial moves. 
Even with In /final = 10~^ and p = 0.8, our DOS esti- 
mates deviate max. a few percent from the exact values 
only. (Note that the bias term is generally one to two 
orders of magnitude smaller than the variance term.) 

Reducing / without increasing p leads to a saturation 
of the NMSE while increasing p without sufficiently de- 
creasing / may result in frozen in systematic errors, see 
the blue curve in Fig. [2] (top). These findings on short 
protein sequences suggest that for p 0.8 no further 
gain in accuracy could be achieved when In/ < 10~^, 
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HH contacts 

FIG. 2. (Color online) Normalized mean square error (NMSE) 
of DOS estimates for an HP 20mer on the square lattice as a 
function of modification factor / and flatness criterion p. Top: 
NMSE of ^(^min = -9) vs In / for several p. Bottom: NMSE 
over the entire energy range for two pairs of In /final and p. 
The solid lines show the entire NMSE (error bars smaller than 
symbol size) and the dotted lines (triangles) show the bias 
terms only. 



a result in agreement with WL studies of other polymer 
models However, for longer sequences we observed that 
the difficult to access low energy states were often found 
only during late iterations and large statistical fluctu- 
ations in the low energy DOS estimates still remained 
at In / ^ 10~^. Only further iterations with reduced / 
could effectively eliminate such fluctuations. Eventually, 
the choice In /final = 10~^ and p = 0.8 turned out to yield 
accurate and reliable DOS estimates over the entire en- 
ergy range at feasible computational costs. This choice of 
parameters requires three to four orders of magnitude less 
Monte Carlo trials until convergence of the DOS as com- 
pared to the most rigid pair chosen here (In /final = 10~^^ 
and j9 = 0.999). 

Final note: The way in which / is decreased affects the 
accuracy of the final DOS and the time of convergence 



of the WL procedure. In particular, it has been found 
that, for constant p, decreasing / exponentially (e. g. 
V7oid ~^ /new) eventually leads to saturation in the error 
of the DOSr^i^i^ see also Fig. [2] (top). Some promis- 
ing modifications have been proposed for alleviating this 
problem, e.g. the 1/t algorithm^^ or dynamic modifica- 
tion rules for / accounting for the fluctuations in the his- 
togram Hj^^ However, no generic "optimal" rule of re- 
ducing / has yet been found without introducing further 
(unknown) system-type and -size dependent simulation 
parameters which need to be adjusted in order to make 
the algorithm effectively converging and outperforming 
the original approach. For instance, the normalization 
factor of Monte Carlo time in the 1/t algorithm is such 
a parameter and an improper choice may even lead to 
non-convergence.^^ The process of finding optimal simu- 
lation parameters, however, can become computationally 
very demanding. On the other hand, our methodological 
experience for lattice polymers has shown that a good 
choice of efficient MC trial moves has a much larger im- 
pact on the overall performance and efficiency of the WL 
procedure than such parameter "tunings". Therefore, in 
this study we kept the standard WL rule of reducing / 
with its proven robustness over a large spectrum of prob- 
lems. 



D. Calculation of thermodynamic and structural 
quantities 

Because the energy density of states g{E) does not de- 
pend on temperature T, an estimate of g{E) allows us 
to calculate the partition function Z{T) (up to a multi- 
plicative constant), see Eq. ([1]), and its derived thermo- 
dynamic quantities at any temperature. For instance, 
the internal energy U{T) and specific heat C{T) are ob- 
tained as 

U{T) = ^ E E9{E)e-^^'^ ^ {E) (5) 

and 

Besides thermodynamic quantities, structural observ- 
ables such as the radius of gyration, etc., are equally 
important to provide insight into the conformational 
changes taking place with varying temperature. Princi- 
pally, such structural observables could be sampled si- 
multaneously during the late iterations of the Wang- 
Landau procedure. Often though, some experimentation 
is required to find appropriate observables; thus, it may 
be more convenient to sample them in a second simula- 
tion step. 

Here, a "Wang-Landau resampling" procedure has 
been employed, where the final g{E) obtained from WL 
samphng is further updated (using a small modification 



7 



factor, In / < 10~^) and the same Monte Carlo accep- 
tance criterion as in Eq. ([2]) is applied. Alongside the 
simulation, structural quantities are measured and the 
sampling stops when all (previously "visited" ) bins of the 
energy histogram H{E) have reached a sufficient number 
of hits (of the order of 10^/10^). The thermodynamic 
average of a quantity Q is then obtained from 

(g)(T) = ^^e-^/^^giEmE). (7) 
Q{E) is the average of Q at energy i. e. 

^^^^ - Ec,HiE,Q) ' 

where H{E,Q) is the two-dimensional histogram in E 
and Q. Because of its advantageous dynamics, this WL 
resampling procedure turned out to cover conformational 
space more uniformly and faster (including conforma- 
tional regions of low energy) than with standard mul- 
ticanonical sampling However, an accurate estimate of 
g{E) is essential for this procedure to yield reliable re- 
sults. 



III. RESULTS AND DISCUSSION 
A. Ground state search 

Various benchmark HP sequences have been designed, 
either as simplified counterparts to real proteins (e.g. 
a 103mer for Cytochrome C),^^ or just for the purpose 
of algorithm testing j^^i^^ The longest sequence explored 
systematically so far, contains 136 residues. 

First, we applied our method to a series of ten 48mers 
(3D)^ which has been used extensively for testing of al- 
gorithmic efficiency, see Table HI For each HP sequence 
we measured the lowest energy found (£^min), the time 
between independent hits of a ground state (thit, in this 
study defined as the time of a "round trip", where a 
conformation with zero energy must have been visited at 
least once between two consecutive hits of a ground state) 
and the time until convergence of the density of states 
over the entire energy interval [£^min,0] (^dos)- For sta- 
tistical significance, WLS timings depict the interquartile 
mean of 20 independent WL runs. 

Our findings were compared with two other methods 
for which timing data are available: (i) an improved 
variant of the pruned-enriched Rosenbluth method 
(nPERMis)^^ and (ii) fragment regrowth Monte Carlo 
via energy-guided sequential sampling (PRESS) Al- 
though it is clear that a comparison of timings is al- 
ways limited because of the different implementations 
and CPU speeds, it serves, nonetheless, as a good in- 
dicator of performance, in particular, when considered 
among several HP sequences. 



TABLE L Comparison of performance for different methods 
on a series of ten HP sequences with chain lengths A" = 48 in 
3D.— ^min denotes the minimum energy reported by all meth- 
ods. Columns 3-5 give the times (in minutes) between inde- 
pendent hits of ^min (^hit)- The last column depicts the WL 
convergence times of the DOS in the energy interval [^min, 0] 
with In /final = 10~^ and p = 0.8. For details of nPERMis 
and PRESS, see Refs. [H andlH, respectively; for nPERMis, 
we only list the timings for the non-biased version. 

Seq. ^nain WLS^ uPERMis^ FRESS^ tDos(WLS) 



1 


-32 


0.32 


0.63 


0.72 


23.1 


2 


-34 


0.84 


3.89 


0.88 


54.9 


3 


-34 


0.68 


1.99 


0.77 


28.3 


4 


-33 


0.59 


13.45 


0.53 


25.5 


5 


-32 


0.23 


5.08 


0.72 


15.5 


6 


-32 


0.39 


6.60 


0.68 


21.3 


7 


-32 


1.58 


5.37 


1.12 


49.5 


8 


-31 


0.58 


2.17 


0.80 


37.4 


9 


-34 


3.10 


41.41 


0.73 


63.2 


10 


-33 


0.98 


0.47 


0.73 


34.8 



^ 2.8 GHz AMD Opteron 2439 
^ 167 MHz Sun ULTRA I 
^ 1.4 GHz PC 



nPERMis is a thoroughly tested, "pure" chain-growth 
algorithm that has been very successful on many HP se- 
quences. It is a generic Monte Carlo sampling scheme 
able to generate the correct Boltzmann weights. (An- 
other improved PERM variant, nPERMh,^^ has been ex- 
clusively designed for conformational search and is, there- 
fore, excluded here; for the flat histogr am/mult icanonical 
PERM versions ^iSiSS only sparse data are available and 
unfortunately no timings have been reported.) 

In PRESS, Monte Carlo trial moves consist of cut- 
ting out and regrowing pieces of variable length along 
the protein chain. This update scheme yields a very effi- 
cient conformational search strategy and, so far, PRESS 
has been the only method capable of finding putative 
minimal energies for all commonly studied benchmark 
HP sequences. However, although principally usable for 
Boltzmann sampling, it has achieved these results with 
methodological tuning ("shortcuts") only, e.g. usage of 
simulated annealing, neglect of detailed balance, or adop- 
tion of a depth-ffist-search strategy to regrow chain frag- 
ments. 

Table |I] shows that for these still relatively short se- 
quences, the three methods perform consistently well. 
They all find the same putative ground states and the 
timings between independent hits are comparable (al- 
though the timings for nPERMis fluctuate considerably 
more than for WLS and PRESS). 

Another series of ten 64mers (3D)^^ showed a similar 
picture (details not shown here). We found the same 
putative ground state energies as nPERM^^ and PRESS. 
Our timings ranged from a few seconds up to around 
one hour, those for nPERM from a few seconds up to 8 
hours (on a 2 GHz PC); no timings have been reported 
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TABLE II. Comparison of performance for different methods 
on benchmark HP sequences with N > 50 on square (2D) and 
simple cubic (3D) lattices. For each sequence, the original ref- 
erence is cited. Columns 3-5 give the times (in hours, except 
otherwise stated) between independent hits of the respective 
energy minima ^min (thit)- The last column gives the WL 
convergence times of the DOS in the energy interval [^min, 0] 
with In /final = 10~^ and p = 0.8. For details of nPERMis 
and FRESS, see Refs. ITsI and [23l, respectively. NA means no 
data available. 



Sequence Emin WLS^ nPERMis^ FRESS^ tDos(WLS) 



2D64^^ 


-42 


< 10s 


30 


< 0.33 


0.28 


2D85^ 


-53 


0.03 


0.17 


< 0.33 


1.06 


2D100a^ 


-48 


0.07 


0.04 


< 0.33 


1.51 


2D100b^ 


-50 


0.17 


5.8 


< 0.33 


2.49 


3D5S^ 


-44 


0.12 


0.19 


0.09 


2.23 


3D64^ 


-56 


0.09 


0.45 


0.53 


2.61 


3D67^ 


-56 


0.99 


1.12 


1.41 


13.80 


3D88i^ 


-69 


0.45 


NA 




8.30 




-72 


8.92 




5.03 


122.86 


3D103^ 


-54 


0.01 


3.12 




0.65 




-57 


0.93 




4.47 


9.79 




-58 


NA 






NA 


3D124^ 


-71 


0.06 


12.3 




2.28 




-75 


^ 104 




< 14d 


> lOd 


3D136^ 


-80 


2.98 


110 




34.17 




-82 


^ 89 




6.42 


> lOd 




-83 


> lOd 




< 14d 


NA 



^ 2.8 GHz AMD Opteron 2439 

^ 667 MHz DEC ALPHA 21264 (seq. 2D85: 167 MHz Sun 

ULTRA I) 
^ 1.4 GHz PC 



for FRESS. 

As a more stringent test we have selected particu- 
lar benchmark HP sequences with TV > 50 in two and 
three dimensions (for a listing of these sequences, see e. g. 
Ref . [23[) . Table ini summarizes the results (WLS timings 
were obtained as above). Overall, the performance of 
an algorithm is now more sequence dependent but two 
tendencies are most striking: In 2D, nPERMis shows 
large timing differences dependent on the HP sequence 
whereas WLS and FRESS perform much more homoge- 
neously. For longer sequences in 3D {N > 88), nPERMis 
performs considerably worse than the other two methods, 
both in terms of energy minima found as well as in corre- 
sponding timings. Difficulties in finding the ground state 
can sometimes be traced back to certain characteristics 
of the native structure, e. g. the formation of a very dis- 
tinct hydrophobic core (3D88) or chain terminals deeply 
buried in the interior of the structure (2D64);^^ see the 
examples of ground state structures for each sequence of 
Table M in Figs. H [5] and [6l 

Often though, the causes are less apparent and look- 
ing at a few single ground state structures is insufficient. 
Instead, it is necessary to consider the entire ensemble. 
For instance, the native states for the two lOOmers in 
2D (2D100a and 2D100b, see Fig. g]), look very similar 




20 40 60 80 



FIG. 3. (Color online) Contact map density of ground states 
for the sequence 2D 100a (lower triangle, Emin = —48) and 
2D 100b (upper triangle Emin = —50), respectively. For 
each sequence, densities have been calculated from a sam- 
ple of more than 10,000 contact maps of ground states con- 
tributed from 20 independent WL production runs. Only non- 
vanishing HH contact densities are shown. 



(as are their minimal energies) and do not feature any 
peculiarities. Nonetheless, nPERMis shows timing dif- 
ferences of two orders of magnitude between the two se- 
quences whereas WLS and FRESS perform equally well 
in both cases. In order to better understand this be- 
havior, we have calculated the contact map densities of 
the ensembles of respective ground states for both se- 
quences, see Fig.O The figure illustrates clearly that na- 
tive structures in case of sequence 2D 100a are rather ho- 
mogeneously distributed in conformational space whereas 
for sequence 2D 100b they are concentrated in fewer but 
denser locations manifesting a lower degeneracy. As the 
timings in Table ITTl indicate . this difference does not seem 
to impact the sampling abilities of FRESS and WLS. Ow- 
ing to their efficient Monte Carlo update schemes, both 
algorithms can easily traverse between distant regions of 
conformational space. 

In contrast, nPERMis needs much more time to ex- 
plore different portions of conformational space, once 
growth proceeds far in one direction (despite sophisti- 
cated pruning/enrichment mechanisms). This problem 
becomes more severe for longer sequences and denser 
sampled conformational spaces. Therefore, we conjecture 
that this is also the main cause of troubles for the longer 
sequences in 3D. Even though stronger heuristic^i^i^ 
or flat-histogram/multicanonical approaches^^*^ may 
partially alleviate these difficulties, the problem remains 
inherent in any static chain-growth algorithm (i. e. one 
that always grows from the same starting point) when 
sampling the generally dense conformational spaces at 
low temperature. The apparent tendencies exposed by 
this comparison of methods in Table |TT1 clearly contra- 
dict the statements in Ref. HO where it has been argued 
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that chain-growth algorithms perform better than Monte 
Carlo algorithms employing trial moves in simulating lat- 
tice proteins at low temperatures. This contradiction 
stems from the fact that the latter methods, such as 
Wang-Landau sampling, strongly depend on the chosen 
trial move set; a point likely not given enough attention 
in Ref. [20|. 

With our strategy, we were able to find all currently 
known putative ground state energies for these bench- 
mark HP sequences, inclusive a conformation with energy 
—58 for the sequence 3D103. This result has also been ob- 
tained by constrained-based protein structure prediction 
(CPSP)^ However, CPSP is not a "blind" search algo- 
rithm but rather based on the threading of a sequence 
through a set of pre-calculated compact hydrophobic 
cores. Thus, it can also not yield thermodynamic prop- 
erties and is not applicable to two-dimensional systems 
where native structures do not necessarily form compact 
H-cores. To our knowledge, our procedure is the only 
generic and fully blind Monte Carlo sampling scheme 
achieving these results so far. 

Sequence 3D 103 demonstrates particularly well the al- 
gorithmic improvements (not just increase of CPU speed) 
that have been achieved in the course of time. In 1994, 
Lattman et al,^^ proposed a first ground state energy 
£^ = — 44 by means of hydrophobic zipper. The sequence 
then served as a testing ground for various methods, 
ranging from contact interaction^ {E = —49), PERM 
variants^^'^o _ FRESS^^ {E = -57) 

to WLS^^ {E = —58). Due to insufficient sampling of 
the ground state we could not obtain an estimate of the 
relative DOS between the ground and first excited states, 
but we certainly now know that the probability of finding 
a conformation with energy £^ = — 58 is definitely more 
than 20 orders of magnitude smaller than for a confor- 
mation with E — —44. 



B. Thermodynamic properties 

With regard to conformational search, the robustness 
{Emin^ found) and performance (timings) of PRESS and 
our approach are very much alike with some sequence- 
dependent differences. The main advantage of our ap- 
proach over PRESS, however, lies in the usage of a sim- 
pler, but equally efficient, Monte Carlo trial move set 
which can be readily combined with Wang-Landau sam- 
pling. This strategy allowed us not only to find putative 
ground states but, at the same time, to obtain estimates 
of the DOS with very high accuracy (In /final = 10~^ 
and p = 0.8). The DOS gives access to thermodynamic 
quantities over the whole temperature range and, in par- 
ticular, enables us to better explore the intricate folding 
behavior of these model proteins at low temperatures. 

Figs, m [5] and [6] show the specific heats, C (see Eq. [6j), 
as a function of temperature (T) for the HP sequences of 
Table HIl In free space (as studied here), the C curves of 
most sequences exhibit two major transitions: 



(i) At high T, a wide peak in C indicates the collapse 
of the protein from a swollen coil to a compact globular 
(pseudo-) phase; see also the snapshots of structures at 
kT/e = 1.0 and kT/e = 0.62 in Fig. El (Note that the 
term "phase" is actually not strictly defined as we are 
considering systems of finite size only.) Since the struc- 
tures in both phases are still of a random nature, the 
location of this transition (Tc of specific heat maximum) 
shows little sequence dependency (at least for these still 
relative short protein-like sequences, see also Sec. IHI C]) . 
It depends, however, on the fraction of H monomers in 
the sequence. This explains, for instance, the very similar 
transition temperatures of sequences 2D100a (Tc ~ 0.49, 
55% H) and 2D100b (T^ ^ 0.48, 56% H) and the higher 
one for sequence 2D85 (T^ ^ 0.65, 69.4% H). A higher 
fraction of H residues raises the collapse transition tem- 
perature because of the generally larger number of HH 
contacts which need to be broken upon going from a glob- 
ular to a coil structure; see also the discussion in Ref. [2l|. 

(ii) At low T, the specific heat signals another tran- 
sition which manifests the folding of the protein from 
a random globule to its native state(s). Shape, magni- 
tude and location of this transition are all very sequence 
dependent; the curves of C range from a barely visible 
shoulder (2D100a, Fig. 2]) to a very sharp peak with a 
clear first-order-like bimodal energy distribution (3D67, 
see also the inset figure and the typical structures at the 
lower two temperatures in Fig.[5j). Low ground state de- 
generacy or energy gaps may give some indication about 
the nature of this transition. But ultimately, it is only 
the entire low energy range of the DOS which determines 
the exact character of this transition. Therefore, accurate 
sampling of the low energy range is compulsory to gain 
a conclusive picture of the low temperature folding be- 
havior. Otherwise, important features of some sequences 
do not show up at all (e. g. the sharp peak in the specific 
heat of sequence 3D88 in Fig. [6l indicating a pronounced 
folding transition which leaves the coil-globule transition 
as only a flat shoulder remnant). 

None of the HP sequences studied here has an energy 
gap between ground and excited states; such a gap has 
been considered a prerequisite of a good folder (i. e. hav- 
ing a pronounced global energy minimum).^ However, the 
longest two sequences show very sharp drops (several or- 
ders of magnitude) in the DOS between first excited and 
ground state (3D 124) or between the lowest excited states 
(3D136). These "kinks" in the DOS cause the sharp, low 
T specific heat peaks. For sequence 3D124, this is the 
only signal in the low temperature regime (no folding 
transition). The folding transition of sequence 3D136 is 
located near kT/e ~ 0.35; below this T, two excitation 
transitions occur (a weak and a very sharp near T ^ 0). 
Clearly, the locations and amplitudes of these very low 
T peaks in C are somewhat uncertain. 

Beside the specific heat, structural quantities have 
been calculated to get further insight into the folding 
process. For instance, the root mean squared radius of 
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FIG. 4. (Color online) Specific heat, C/N (black curves, left ordinates), and canonical ensemble averages of the root mean 
squared radius of gyration, Rg/N (red curves, outer right ordinates), and tortuosity, r (blue curves, inner right ordinates), as a 
function of temperature (abscissas: kT /e) for several benchmark HP sequences on the square lattice (2D). For each sequence, C 
has been obtained from the average DOS of 20 independent WL simulations in the corresponding energy interval [^min, 0], and 
Rg^ T from 20 independent WL resampling production runs. Errors have been estimated by means of a bootstrap analysis^ 
with 200 resamples and are shown where visible only. For each sequence, a typical ground state conformation is shown on the 
left with red and light-gray beads indicating hydrophobic (H) and polar (P) residues, respectively. For further explanations, 
see text. 



gyration, Rg, 

(i ^ V" 

{ri is the position vector of the i^^ monomer and 
Vcm is the center of mass), is known to be a well 
suited observable to signal the collapse of homo- and 
heteropolymersJ^i2£ Our figures (red curves in Figs. IH 
[5] and [6]) confirm this observation and show a clear coin- 
cidence between the upper maximum in C and the strong 
decay of Rg. Due to finite size effects, the steepest slope 
of Rg is slightly shifted towards higher T as compared to 
the specific heat maximum (Tc). On the other hand, Rg 
shows little or no signal in the low temperature regime. 
The polymer end-to-end distance. Re (not shown here), 
has very similar properties as Rg and shows the same S- 
shaped curve indicating the collapse. At low T, R^ may 
have a minimum and rise again as T ^ 0. This effect 
is more pronounced for HP sequences with P terminals 
signaling the tendency to push P monomers towards the 
surface in order to maximize the formation of a compact 
hydrophobic core.^^ However, measuring the spatial ex- 
tent of a polymer only, Rg^ R^^ or similar properties like 
e. g. box size or surface P number^ are insensitive to the 
sequence specific internal conformational (and topologi- 



cal) changes taking place upon folding from the globular 
denatured states to the ground state. 

Measures of conformational "distance" to the native 
state, based on an adjacency matrix^^ or the number of 
native contacts (e. g. the Jaccard index) may better 
serve as a reaction coordinate. However, for sufficiently 
long chains, calculations of such measures can become 
computationally intractable and they may also bear some 
ambiguity (e. g. dependence on the choice of move set or 
large degeneracy of the native state). Moreover, a re- 
quirement of these observables is that the native state (s) 
being effectively known to give meaningful results. Thus, 
it still remains a challenge to "design" a (computation- 
ally affordable) structural observable which reflects the 
cooperative rearrangements and activity indicated by the 
peaks or shoulders in the specific heats at low T. 

Based on the ideas of the "DNA walk" ^ we propose 
here a scalar structural observable which might provide 
better insight into the internal topological changes tak- 
ing place upon folding. We define the tortuosity (or 
writhing/ winding measure) of the protein as 

CAr-2 \ 
^E(^^ ' (10) 
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FIG. 5. (Color online) Main figure: Same as Fig. S] for an 
HP 67mer on the simple cubic lattice (3D).— The inset figure 
shows the canonical energy distribution, p{E), at the tem- 
perature where C takes its maximum in the sharp peak. Be- 
low, typical structures are shown at indicated temperatures 
and energies (/cT/e, E). The native state of this sequence 
(^min = —56) resembles an a//3-barrel in real proteins. The 
ground state and the structure with ^ = — 48 correspond 
to the two maxima of the bimodal distribution of p{E) in 
the inset figure illustrating the significant conformational re- 
arrangements taking place at the folding transition. Each of 
the small figures at the bottom shows the "winding walks" [si 
vs i, see Eq. (|lQp ] of 10 structures at corresponding energies. 
For further explanations, see text. 



where 



l<i< N -2. 



(11) 



Vjj-^i and denote the two-dimensional vectors be- 

tween monomers (j, j + 1) and (j, j + 2), respectively 
(in the plane spanned by these three monomers). Their 
cross product determines whether two consecutive bonds 
define a left (—1) or right (+1) turn or are collinear (0)^^ 
Si is the running sum of bond turns along the polymer 
chain ("winding walk"); examples of Si for some confor- 
mations at different energies are shown at the bottom of 
Fig. [5l The standard deviation of s^, here denoted as r, 
is thus a measure of polymer tortuosity. (Note that r is 
invariant to the direction of the running sum Si along the 
chain.) 



The blue curves in Figs. HI [5] and [6] display r as a 
function of T. Overall, they reveal the following picture: 
By lowering T, the associated contraction of the protein 
results in a reduction of conformational degrees of free- 
dom and, thus, in a decrease of "winding freedom" as 
well. At a certain point in the low temperature regime, 
r undergoes a significant change, showing a sharp drop- 
off (increase) or passing over a rounded peak (through). 
The diversity of signals is a consequence of the different 
sequences of H and P residues as well as the different 
chain lengths. But the important observation is that the 
temperature region of this change clearly does not co- 
incide with the collapse transition (where the protein's 
shape is still random) but with the low temperature sig- 
nal in the specific heat (i.e. the folding transition). At 
this temperature, internal structural rearrangements take 
place (e. g. breaking of HH contacts of compact denatured 
states) which eventually allow the protein to assume its 
native state. Being sensitive to exactly such topologi- 
cal transformations, r reflects this folding behavior. It 
serves, thus, as a complementary structural quantity to 
better interpret the thermodynamic activity observed in 
the specific heats. It might be conjectured that the shape 
of r could even give some deeper insight into the fold- 
ing characteristics, such as a sharp drop of r indicating 
a folding funnel or a peak signaling a folding barrier. 
Furthermore, the standard deviation is just one (simple) 
means of analyzing the "winding walks" Si and other, 
more appropriate, measures could be conceived. These 
questions are left for further research. 



C. Random vs protein-like HP heteropolymers 

Many of the above HP sequences (Figs. 21 [5] and [6]) 
have been derived from real proteins and mimic protein- 
like behavior exhibiting some kind of collapse and fold- 
ing transitions. However, for these relatively short chains 
{N < 100), the characteristics and strengths of the two 
transitions are highly sequence dependent and subject 
to strong surface effects in the folded state (for a perfect 
cube of = 125, only about 20% of the monomers are in 
the bulk). Although there is no thermodynamic limit in 
the exact statistical sense for any single heteropolymer, 
here we are interested in identifying the generic differ- 
ences between random and protein-like HP sequences in 
the long-chain limit. 

To address this question, we have studied several long 
{N = 500) HP chains, each with 50% H and 50% P 
monomers in either random or protein-like sequence. The 
latter have been constructed as follows)^ Starting from 
a compact, globular homopolymer conformation (i. e. 
only H), the 50% of monomers which are farthest from 
the center of mass are marked as P (polar). Repeat- 
ing this process for different homopolymer conformations 
yields protein-like HP sequences with distinct hydropho- 
bic cores. Longer HP polymers allowed us also to further 
test the efficiency of our procedure. By splitting up the 
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FIG. 6. (Color online) Same as Fig.|4]for several benchmark HP sequences on the simple cubic lattice (3D). For sequence 3D88, 
the dashed line shows the specific heat (C/N) calculated from the DOS with energy range [—69, 0] only. For sequences 3D103 
and 3D136, the DOS/observables have been only obtained for the energy ranges [—57,0] and [—82,0], respectively, because of 
the difficulty to sample the respective ground states. Since r is very sensitive at low T, for sequences with N > 100, structural 
observables have been averaged from 50 independent WL production runs. 



calculation of the DOS into two unequal energy windows 
(the lower one covering only 10-15% of the entire en- 
ergy interval with sufficient overlap to the larger window 
by e.g. 25 energy levels), we could accurately estimate 
the DOS over more than 95% of the entire energy range 
of these sequences within a couple of CPU days. Note 
that for these chain lengths, finding the ground state is 
currently still out of reach for any method due to the 
exponentially increasing computational complexity; but 
this is not a requirement here. 

An important dissimilarity between random and 
protein-like sequences becomes already visible in the 
DOS: For given ratios of H and P monomers, both types 
of sequences span roughly the same energy range (here 
^min ^ —400) but the DOS is steeper at low energies 
for protein-like sequences. This behavior is in agreement 
with the notion of folding funneli where the occupied 



conformational space of a protein abruptly reduces when 
approaching the native state. 

Another difference arises in their thermodynamic and 
structural behaviors. As shown in Fig. [7] the collapse 
transition for the protein-like heteropolymer is signifi- 
cantly shifted towards higher T as compared to the ran- 
dom sequence (Tc ~ 1.2 and Tc ~ 0.9, respectively). This 
can be rationalized by the "design" of the protein-like se- 
quences favoring the formation of hydrophobic cores al- 
ready at Tc, with a significantly lower energy compared to 
the random case (see the corresponding structures with 
energies —240 and —193, respectively). Interestingly, 
however, among sequences of the same type (i. e. either 
random or protein-like), Tc shows only little variation. 
For both sequences, the drop-off of Rg coincides generally 
better with the upper specific heat peak (collapse tran- 
sition) as compared to the shorter HP sequences studied 
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FIG. 7. (Color online) Comparison of thermodynamic (C) 
and structural (Rg^r) properties between a random (top) and 
protein-like (bottom) HP heteropolymer. Both sequences con- 
stitute 500 monomers with a 50% : 50% ratio among H and P. 
For both sequences, typical structures are shown at indicated 
temperatures and energies (kT/e, E). 



above. 

The specific heat of the protein-like sequence shows a 
clear separation between collapse and folding transition 
(although the latter is often manifested by a shoulder 
only) exemplifying a rough but structured energy land- 
scape. The course of the winding measure (r) corrob- 
orates this observation signaling a division between col- 
lapse and folding phase. In contrast, for the random 
heteropolymer, the two transitions are more interleaved, 



and the collapse transition is largely suppressed by the 
generally stronger transition to the ground state. This 
is a consequence of the inability to form a hydrophobic 
core and the absence of conformational "guidance" (fold- 
ing path) towards the ground state. Although r features 
also a more or less pronounced drop-off, its occurrence 
does not coincide with the low T peak in the specific heat 
anymore (as observed for some of the shorter protein-like 
HP sequences above). 

A systematic analysis of these observations, by consid- 
ering a large ensemble of HP sequences, would be both a 
computational challenge and an interesting topic to bet- 
ter understand the generic behavior of these model pro- 
teins. 



IV. CONCLUSION 

We have presented a generic Monte Carlo scheme based 
on Wang-Landau sampling with suitable trial moves 
(combination of pull, bond-rebridging and pivot moves) 
which is very efficient for the simulation of HP lattice 
proteins and the analysis of their thermodynamic prop- 
erties over the entire temperature range. Because of its 
intrinsic simplicity and flexibility, our method is ideally 
suited for the many variations of this common protein 
model, e.g. protein adsorption, ^^'''^ protein aggregation 
and the like. For instance, with our present, optimized, 
algorithm, it takes only about 5.5 hours (^dos) to obtain 
the entire DOS of an HP 36mer adsorbing on a weakly 
attracting surface (i. e. between one to two orders of mag- 
nitude faster than in Ref . IH) . This system has been used 
as a benchmark to demonstrate the efficiency of a re- 
cently proposed Wang-Landau scheme coupled with con- 
figurational bias Monte Carlo which achieved toos ^ 28h 
under the same conditions.^ Besides considerable higher 
performance, the main advantage of our approach is that 
its efficiency does not depend on the tuning of exter- 
nal simulation parameters (e. g. an unphysical tempera- 
ture as in Ref. M)- Our approach has also proven pow- 
erful for exploring the low-temperature thermodynamics 
of interacting self- avoiding walks, even for chain lengths 

Among the various ingredients used to tackle simula- 
tions of lattice proteins or polymers at low temperatures 
and high densities, ranging from the Monte Carlo driver 
to various tuning schemes, it turned out that the use 
of appropriate Monte Carlo trial moves has the greatest 
impact on the overall performance and accuracy of the 
algorithm. However, we stress that it is the interplay 
between Wang-Landau sampling, trial move set and ef- 
ficient implementation which ultimately resulted in the 
overall robustness and performance of our approach. 

Substantial further efficiency improvements would 
likely be gained with suitable parallelization techniques. 
Subdivision of the sampling of the DOS into smaller 
energy windows or the use of multiple random walkers 
which simultaneously update a single DOS are two al- 
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ready well studied approaches in this direction.-^ A more 
advanced and interesting parallelization scheme would al- 
low exchange of conformations among several indepen- 
dently running Wang-Landau samplers, similar to the 
ideas employed in parallel tempering. Such "conforma- 
tional mixing" could considerably reduce the tunneling 
time of a random walk between the energy boundaries 
and thus speed up the overall convergence of the simula- 



tion. 
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